Even though deep learning models for abnormality classification can perform
well in screening mammography, the demographic and imaging characteristics
associated with increased risk of failure for abnormality classification in
screening mammograms remain unclear. This retrospective study used data from
the Emory BrEast Imaging Dataset (EMBED) including mammograms from 115,931
patients imaged at Emory University Healthcare between 2013 to 2020. Clinical
and imaging data includes Breast Imaging Reporting and Data System (BI-RADS)
assessment, region of interest coordinates for abnormalities, imaging features,
pathologic outcomes, and patient demographics. Deep learning models including
InceptionV3, VGG16, ResNet50V2, and ResNet152V2 were developed to distinguish
between patches of abnormal tissue and randomly selected patches of normal
tissue from the screening mammograms. The distributions of the training,
validation and test sets are 29,144 (55.6%) patches of 10,678 (54.2%) patients,
9,910 (18.9%) patches of 3,609 (18.3%) patients, and 13,390 (25.5%) patches of
5,404 (27.5%) patients. We assessed model performance overall and within
subgroups defined by age, race, pathologic outcome, and imaging characteristics
to evaluate reasons for misclassifications. On the test set, a ResNet152V2
model trained to classify normal versus abnormal tissue patches achieved an
accuracy of 92.6% (95%CI=92.0-93.2%), and area under the receiver operative
characteristics curve 0.975 (95%CI=0.972-0.978). Imaging characteristics
associated with higher misclassifications of images include higher tissue
densities (risk ratio [RR]=1.649; p=.010, BI-RADS density C and RR=2.026;
p=.003, BI-RADS density D), and presence of architectural distortion (RR=1.026;
p<.001). Small but statistically significant differences in performance were
observed by age, race, pathologic outcome, and other imaging features (p<.001).